Generation of Bilingual Dictionaries using Comparable and Quasi Comparable Corpora
نویسنده
چکیده
The amount of information available on the web is increasing rapidly. The number of internet users is also increasing every day. A significant section of internet users is monolingual. They want to express themselves in their native language and also seeking information in the same. Hence, multilingual content over the internet is also increasing at a rapid pace. There is a need of systems which empowers such internet users to let them express themselves in the language of their choice and can provide them answers to information requests irrespective of the language barrier. Cross-lingual term associations are very important for many interlingual applications. Machine Translation system is a sub-field of computational linguistics which aims at translating text from one language to other. All MT systems at the core depend on bilingual dictionaries. The bilingual dictionaries are important resources for such NLP applications as statistical machine translation and cross-language information extraction systems. The bilingual dictionary is one such important resource where entries are word translations. They also can serve to enhance existing dictionaries, for second language teaching and learning. Manually created resources are usually more accurate and do not contain noisy information, in contrast, to automatically learned dictionaries. Scientific community seeks for the methods to achieve similar accuracy level and broader terminology scope by automatic means. In this thesis, we try to address the problem of generating bilingual dictionaries automatically. We propose two different approaches to generating bilingual dictionaries for English-Hindi pair. Both the approaches proposed in this thesis are language independent and hence can be used to build dictionaries for other language pairs which are phonetically rich. Hindi is very under-represented on the web because of many technical and socio-cultural reasons. Many languages, like Hindi, suffer from the limited availability of the language resources/tools. It is crucial to develop a language-independent approach for such languages. In this thesis, though we have shown our experiments for English and Hindi language pair, our approach can be easily extendable to other language pairs as well. We didn’t use any language specific resource in the approaches which we proposed here. We have chosen two of the most
منابع مشابه
Automatic Generation of Bilingual Dictionaries Using Intermediary Languages and Comparable Corpora
This paper outlines a strategy to build new bilingual dictionaries from existing resources. The method is based on two main tasks: first, a new set of bilingual correspondences is generated from two available bilingual dictionaries. Second, the generated correspondences are validated by making use of a bilingual lexicon automatically extracted from non-parallel, and comparable corpora. The qual...
متن کاملEnrichment of Bilingual Dictionary through News Stream Data
Bilingual dictionaries are the key component of the cross-lingual similarity estimation methods. Usually such dictionary generation is accomplished by manual or automatic means. Automatic generation approaches include to exploit parallel or comparable data to derive dictionary entries. Such approaches require large amount of bilingual data in order to produce good quality dictionary. Many time ...
متن کاملExtracting Bilingual Persian Italian Lexicon from Comparable Corpora Using Different Types of Seed Dictionaries
Ebrahim Ansari ([email protected]) et al. 2017. Extracting bilingual per-sian italian lexicon from comparable corpora using different types of seed dictionaries. In " Applications of Comparable Corpora " edited book Berlin Linguistic Press (ed.). Bilingual dictionaries are very important in various fields of natural language processing. In recent years, research on extracting new bilingual lex...
متن کاملSentence Alignment in Parallel, Comparable, and Quasi-comparable Corpora
We explore the usability of different bilingual corpora for the purpose of multilingual and cross-lingual natural language processing. The usability of bilingual corpus is evaluated by the lexical alignment score calculated for the bi-lexicon pair distributed in the aligned bilingual sentence pairs. We compare and contrast a number of bilingual corpora, ranging from parallel, to comparable, and...
متن کاملA Combination of Models for Bilingual Lexicon Extraction from Comparable Corpora
In this paper we present a method to extract bilingual terminologies from comparable non-aligned corpora, by using multiple linguistic knowledge sources, such as: non-parallel corpora, bilingual thesauri, a preliminary bilingual dictionary, etc... We focus on two core technologies: bilingual lexicon extraction from comparable corpora and expansion through thesauri categories based on different ...
متن کاملBilingual Dictionary Extraction from Wikipedia
The way of mining comparable corpora and the strategy of dictionary extraction are two essential elements of bilingual dictionary extraction from comparable corpora. This paper first proposes a method, which uses the interlanguage link in Wikipedia, to build comparable corpora. The large scale of Wikipedia ensures the quantity of collected comparable corpora. Besides, because the inter-language...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016